h2o_ml

Exploring AutoML wth “h2o” package

Introduction

The h2o.automl function in R automates training and tuning model by preprocessing data, running multiple models, and selecting the best one. It requires minimum code and knowledge of Machine learning. h2o.automl consists of both classification and regression algorithms.
Typically used algorithms:

Generalized Linear Models (GLM)
Gradient Boosting Machines (GBM,including XGBoost)
Distributed Random Forest (DRF)
Deep Neural Networks (Deep Learning),
Stacked Ensembles

Note: “h2o” package enables the use of the H2O machine learning platform commands in R. Actual work is done on server, meaning no data is stored in R. R requests via REST API and server returns a JSON file with the information which is then displayed on R

Let us begin with installing and loading the required package

Pre-requisite: Download and install latest Java SE JDK

install.packages("h2o")
library("h2o")

# Download dependency packages:
pkgs <- c("methods", "statmod", "stats", "graphics", "RCurl", "jsonlite", "tools", "utils")
 for (pkg in pkgs) {if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }}

# Initialize H2O on local machine
h2o.init()

# Check version (3.1.0 is incompatible, so check the version then proceed)
versions::installed.versions("h2o")


H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    C:\Users\hp\AppData\Local\Temp\RtmpYp20R4\file6fcb18192e/h2o_hp_started_from_r.out
    C:\Users\hp\AppData\Local\Temp\RtmpYp20R4\file6fce0a548d/h2o_hp_started_from_r.err


Starting H2O JVM and connecting:  Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         4 seconds 807 milliseconds 
    H2O cluster timezone:       Asia/Kolkata 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.44.0.3 
    H2O cluster version age:    9 months and 12 days 
    H2O cluster name:           H2O_started_from_R_hp_zvh539 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.96 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 4.4.1 (2024-06-14 ucrt)

[1] "3.44.0.3"

A glance at the dataset

Here, we will be using Credit data, used to detect bad loans


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

Understanding the dataset

df <- as.data.frame(data)
dim(df)

[1] 6990   25

table(df$bad) # 0: Good loan, 1: bad loan


   0    1 
6065  925

Converting the data types

cols_to_factor <- c("bad","preloan","veh","house",  "selfemp","account","deposit", "branch", "ref",    "age",    "gender", "ms", "child",  "zone", "emp_catg", "address_catg", "debtinc_catg", "creddebt_catg", "othdebt_catg")

data[cols_to_factor] <- as.factor(data[cols_to_factor])
# h2o.str(data)     # View the structure of dataset

Defining the model arguments

# Response column
y <- "bad"

# Predictor column names
x <- c("debtinc","creddebt","othdebt","preloan","veh","house",  "selfemp","account","deposit","emp", "address", "branch", "ref",    "age",    "gender", "ms", "child",  "zone", "emp_catg", "address_catg", "debtinc_catg", "creddebt_catg", "othdebt_catg" )

Running the model

Case 1: Run AutoML for 10mins (600 seconds)

The maximum time that the AutoML process will run for.
The seed parameter may not ensure reproducibility when using max_runtime_secs because the resources available during each run might differ, leading to variations in the results.

# 
aml <- h2o.automl(x = x,                 # If missing, all variables except y are used
                  y = y,                 # Classification if y is factor; else regression
                  training_frame = data, # Specifies the training set
                  max_runtime_secs = 600,# Default 3600 seconds
                  exclude_algos = "DeepLearning" # Options: "DRF", "GLM", "XGBoost", "GBM", "DeepLearning", "StackedEnsemble". Defaults to NULL (uses all algos).
)


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   1%
09:46:45.695: AutoML: XGBoost is not available; skipping it.
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |=======================================                               |  55%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |===================================================                   |  72%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=====================================================                 |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |======================================================================| 100%

# View the leaderboard
aml@leaderboard

                                                model_id       auc   logloss
1 StackedEnsemble_BestOfFamily_6_AutoML_1_20241003_94645 0.7613174 0.3382096
2 StackedEnsemble_BestOfFamily_7_AutoML_1_20241003_94645 0.7610110 0.3383858
3 StackedEnsemble_BestOfFamily_4_AutoML_1_20241003_94645 0.7606597 0.3382553
4    StackedEnsemble_AllModels_1_AutoML_1_20241003_94645 0.7600385 0.3383988
5    StackedEnsemble_AllModels_5_AutoML_1_20241003_94645 0.7599243 0.3385273
6    StackedEnsemble_AllModels_2_AutoML_1_20241003_94645 0.7596834 0.3386844
      aucpr mean_per_class_error      rmse       mse
1 0.3010454            0.3213903 0.3212951 0.1032306
2 0.2998955            0.3103933 0.3213547 0.1032688
3 0.3008490            0.3180442 0.3214169 0.1033088
4 0.3016766            0.3229745 0.3213194 0.1032462
5 0.3010226            0.3288211 0.3213613 0.1032731
6 0.3018481            0.3136263 0.3213904 0.1032918

[109 rows x 7 columns]

Checking which algorithms have been used

full_lb <- h2o.get_leaderboard(aml, extra_columns = "ALL")
algos<-as.data.frame(full_lb$algo)
print(unique(algos))

              algo
1  StackedEnsemble
13             GLM
14             GBM
76             DRF

Case 2: Run AutoML for 10 base models

max_models = n, where n is number of models to build in the AutoML process. Defaults to NULL
seed = 1234, AutoML guarantees reproducibility with max_models or early stopping

# ---------------------------------------------------------------
# Splitting data into train and validation sets
splits <- h2o.splitFrame(data = data, ratios = 0.8, seed = 1234)

train_data <- splits[[1]]   # 80% training data
valid_data <- splits[[2]]   # 20% validation data

aml2 <- h2o.automl(x = x,
                  y = y, 
                  training_frame = train_data,    # Use the training frame
                  validation_frame = valid_data,  # Specify the validation frame
                  max_models = 10, 
                  balance_classes = FALSE, # Specify whether to oversample minority classes; Defaults to FALSE.
                  seed = 1234 
                  )


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==                                                                    |   3%
10:00:55.586: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
10:00:55.592: AutoML: XGBoost is not available; skipping it.
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |==============                                                        |  21%
  |                                                                            
  |================                                                      |  24%
  |                                                                            
  |===================                                                   |  26%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |======================================================================| 100%

# Viewing the best model:
best_model <-aml2@leader
best_model

Model Details:
==============

H2OBinomialModel: stackedensemble
Model ID:  StackedEnsemble_AllModels_1_AutoML_2_20241003_100055 
Model Summary for Stacked Ensemble: 
                                         key            value
1                          Stacking strategy cross_validation
2       Number of base models (used / total)             7/10
3           # GBM base models (used / total)              4/6
4           # DRF base models (used / total)              2/2
5           # GLM base models (used / total)              1/1
6  # DeepLearning base models (used / total)              0/1
7                      Metalearner algorithm              GLM
8         Metalearner fold assignment scheme           Random
9                         Metalearner nfolds                5
10                   Metalearner fold_column               NA
11        Custom metalearner hyperparameters             None


H2OBinomialMetrics: stackedensemble
** Reported on training data. **

MSE:  0.04710521
RMSE:  0.2170374
LogLoss:  0.17241
Mean Per-Class Error:  0.07122993
AUC:  0.9915184
AUCPR:  0.954627
Gini:  0.9830367

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
          0   1    Error       Rate
0      4842  65 0.013246   =65/4907
1        92 620 0.129213    =92/712
Totals 4934 685 0.027941  =157/5619

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold       value idx
1                       max f1  0.323439    0.887616 155
2                       max f2  0.267866    0.906013 180
3                 max f0point5  0.376246    0.910904 133
4                 max accuracy  0.325849    0.972059 154
5                max precision  0.850609    1.000000   0
6                   max recall  0.108709    1.000000 283
7              max specificity  0.850609    1.000000   0
8             max absolute_mcc  0.323439    0.871882 155
9   max min_per_class_accuracy  0.246020    0.950843 190
10 max mean_per_class_accuracy  0.261254    0.952448 183
11                     max tns  0.850609 4907.000000   0
12                     max fns  0.850609  711.000000   0
13                     max fps  0.002354 4907.000000 399
14                     max tps  0.108709  712.000000 283
15                     max tnr  0.850609    1.000000   0
16                     max fnr  0.850609    0.998596   0
17                     max fpr  0.002354    1.000000 399
18                     max tpr  0.108709    1.000000 283

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: stackedensemble
** Reported on validation data. **

MSE:  0.1085061
RMSE:  0.3294026
LogLoss:  0.345963
Mean Per-Class Error:  0.2683476
AUC:  0.8129465
AUCPR:  0.4306114
Gini:  0.625893

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
          0   1    Error       Rate
0       966 192 0.165803  =192/1158
1        79 134 0.370892    =79/213
Totals 1045 326 0.197666  =271/1371

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold       value idx
1                       max f1  0.215229    0.497217 167
2                       max f2  0.106424    0.615281 262
3                 max f0point5  0.269611    0.458372 125
4                 max accuracy  0.424735    0.854121  45
5                max precision  0.710278    1.000000   0
6                   max recall  0.007884    1.000000 387
7              max specificity  0.710278    1.000000   0
8             max absolute_mcc  0.215229    0.394225 167
9   max min_per_class_accuracy  0.152360    0.727700 218
10 max mean_per_class_accuracy  0.142857    0.738847 225
11                     max tns  0.710278 1158.000000   0
12                     max fns  0.710278  212.000000   0
13                     max fps  0.002443 1158.000000 399
14                     max tps  0.007884  213.000000 387
15                     max tnr  0.710278    1.000000   0
16                     max fnr  0.710278    0.995305   0
17                     max fpr  0.002443    1.000000 399
18                     max tpr  0.007884    1.000000 387

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: stackedensemble
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.09546108
RMSE:  0.3089678
LogLoss:  0.3107406
Mean Per-Class Error:  0.2829704
AUC:  0.8041598
AUCPR:  0.3413911
Gini:  0.6083195

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
          0    1    Error        Rate
0      3846 1061 0.216222  =1061/4907
1       249  463 0.349719    =249/712
Totals 4095 1524 0.233138  =1310/5619

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold       value idx
1                       max f1  0.176402    0.414132 215
2                       max f2  0.123801    0.572752 261
3                 max f0point5  0.273286    0.384719 147
4                 max accuracy  0.577190    0.874533  23
5                max precision  0.721615    0.666667   1
6                   max recall  0.005837    1.000000 391
7              max specificity  0.786416    0.999796   0
8             max absolute_mcc  0.123801    0.327319 261
9   max min_per_class_accuracy  0.148034    0.726106 239
10 max mean_per_class_accuracy  0.123801    0.740592 261
11                     max tns  0.786416 4906.000000   0
12                     max fns  0.786416  711.000000   0
13                     max fps  0.001792 4907.000000 399
14                     max tps  0.005837  712.000000 391
15                     max tnr  0.786416    0.999796   0
16                     max fnr  0.786416    0.998596   0
17                     max fpr  0.001792    1.000000 399
18                     max tpr  0.005837    1.000000 391

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Cross-Validation Metrics Summary: 
                mean        sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
accuracy    0.773577  0.058581   0.751489   0.848718   0.777285   0.799823
auc         0.804522  0.028432   0.803688   0.835867   0.788468   0.828019
err         0.226423  0.058581   0.248511   0.151282   0.222715   0.200177
err_count 252.200000 54.476600 292.000000 177.000000 251.000000 226.000000
f0point5    0.356544  0.049552   0.353503   0.395809   0.319321   0.415713
          cv_5_valid
accuracy    0.690570
auc         0.766569
err         0.309430
err_count 315.000000
f0point5    0.298372

---
                        mean        sd cv_1_valid cv_2_valid cv_3_valid
precision           0.322305  0.052393   0.315341   0.373626   0.286232
r2                  0.137534  0.026650   0.147860   0.151705   0.121886
recall              0.651441  0.094227   0.685185   0.519084   0.593985
residual_deviance 696.627200 45.698280 766.717040 644.703800 687.270750
rmse                0.308583  0.012018   0.318258   0.290423   0.302323
specificity         0.790065  0.076651   0.762093   0.890279   0.801811
                  cv_4_valid cv_5_valid
precision           0.377163   0.259162
r2                  0.166646   0.099573
recall              0.703226   0.755725
residual_deviance 709.805660 674.638730
rmse                0.314171   0.317741
specificity         0.815195   0.680947

# Viewing complete model summary (all rows and column)
full_lb2 <- h2o.get_leaderboard(aml2, extra_columns = "ALL")
print(head(full_lb2, n = 100))

                                                  model_id       auc   logloss
1     StackedEnsemble_AllModels_1_AutoML_2_20241003_100055 0.8041598 0.3107406
2  StackedEnsemble_BestOfFamily_1_AutoML_2_20241003_100055 0.8035584 0.3107232
3                           GBM_2_AutoML_2_20241003_100055 0.7969820 0.3181950
4                           GBM_3_AutoML_2_20241003_100055 0.7956907 0.3193538
5                           GBM_1_AutoML_2_20241003_100055 0.7906475 0.3184349
6                           GBM_5_AutoML_2_20241003_100055 0.7899660 0.3209791
7                           XRT_1_AutoML_2_20241003_100055 0.7838551 0.3236697
8                           DRF_1_AutoML_2_20241003_100055 0.7817283 0.3471622
9              GBM_grid_1_AutoML_2_20241003_100055_model_1 0.7783961 0.3339926
10                          GBM_4_AutoML_2_20241003_100055 0.7712734 0.3375627
11                          GLM_1_AutoML_2_20241003_100055 0.7695979 0.3257758
12                 DeepLearning_1_AutoML_2_20241003_100055 0.7236684 0.3551143
       aucpr mean_per_class_error      rmse        mse training_time_ms
1  0.3413911            0.2829704 0.3089678 0.09546108             2464
2  0.3420521            0.2980535 0.3089722 0.09546383             2092
3  0.3295733            0.3009864 0.3125086 0.09766160             1554
4  0.3295498            0.2958531 0.3128368 0.09786684              429
5  0.3226722            0.2963379 0.3121005 0.09740675              461
6  0.3291834            0.2892274 0.3130270 0.09798590              264
7  0.3081259            0.2924907 0.3133731 0.09820268              903
8  0.3170896            0.3149846 0.3126784 0.09776778             1380
9  0.2962752            0.2901390 0.3186610 0.10154485              285
10 0.2842441            0.3174026 0.3205027 0.10272196              488
11 0.2993501            0.3255569 0.3156341 0.09962492              215
12 0.2400872            0.3339925 0.3255017 0.10595138              603
   predict_time_per_row_ms            algo
1                 0.026530 StackedEnsemble
2                 0.016157 StackedEnsemble
3                 0.004990             GBM
4                 0.004476             GBM
5                 0.006429             GBM
6                 0.004617             GBM
7                 0.005900             DRF
8                 0.006899             DRF
9                 0.005292             GBM
10                0.004871             GBM
11                0.001973             GLM
12                0.002916    DeepLearning

The output gives 12 rows (10 Base models + Stacked Ensemble Best of Family + Stacked Ensemble All Models)

“Stacked Ensemble Best of Family”: This ensemble is built using only the best-performing model from each algorithm family (e.g., the best GLM, the best GBM, etc.).
“Stacked Ensemble All Models”: This ensemble uses predictions from all base models, regardless of their performance.

Let’s look at some operations to perform on Test data

We will use the “valid_data” created earlier as test data on the best_model.

# Store model performance
performance <- h2o.performance(best_model, newdata = valid_data)

# ROC curve
# Extract the TPR (True Positive Rate) and FPR (False Positive Rate) from the model performance
fpr <- h2o.performance(best_model, valid = TRUE)@metrics$thresholds_and_metric_scores$fpr
tpr <- h2o.performance(best_model, valid = TRUE)@metrics$thresholds_and_metric_scores$tpr

# Get the AUC value
auc_value <- h2o.auc(performance)

# Plotting the ROC Curve using base R
plot(fpr, tpr, type = "l", col = "blue", lwd = 2, xlab = "False Positive Rate", ylab = "True Positive Rate", main = "ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")  # Add a diagonal line for reference
# Display the AUC value on the plot
text(x = 0.6, y = 0.2, labels = paste("AUC =", round(auc_value, 3)), col = "black", cex = 1.2, font = 2)

# Confusion Matrix
h2o.confusionMatrix(performance)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.21522863095099:
          0   1    Error       Rate
0       966 192 0.165803  =192/1158
1        79 134 0.370892    =79/213
Totals 1045 326 0.197666  =271/1371

# Variable Importance:
h2o.varimp(aml2@leader) # No output since the best model is not tree-based, in our case

# Model Predictions:
predictions <- h2o.predict(aml2@leader, newdata = valid_data)


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

# Convert H2OFrame to R data frame for manipulation
predictions_df <- as.data.frame(predictions)

# Get the optimal threshold based on the Accuracy
optimal_threshold <- h2o.find_threshold_by_max_metric(performance, "accuracy") # precision, f1, recall, etc.
print(optimal_threshold)

[1] 0.4247352

# Apply the optimal cutoff to create new binary predictions
predictions_df$custom_prediction <- ifelse(predictions_df$p1 >= optimal_threshold, 1, 0)

# View predictions
head(predictions_df)

  predict        p0          p1 custom_prediction
1       0 0.9946160 0.005383970                 0
2       1 0.6541355 0.345864512                 0
3       0 0.9831648 0.016835232                 0
4       0 0.9907014 0.009298568                 0
5       0 0.9589116 0.041088399                 0
6       0 0.9678238 0.032176184                 0

Check the balanced class distribution in the H2O training frame

# Original data
table(as.data.frame(data)[[y]])


   0    1 
6065  925

# Balanced data
balanced_data <- h2o.getFrame(aml2@leader@parameters$training_frame)
table(as.data.frame(balanced_data)[[y]])


   0    1 
4907  712

Note: In case of small data set, balancing might not be possible or effective.

View base models included in any particular stacked ensemble

se_model <- h2o.getModel(best_model@model_id)
print(se_model@model$base_models)

 [1] "GBM_2_AutoML_2_20241003_100055"             
 [2] "GBM_3_AutoML_2_20241003_100055"             
 [3] "GBM_1_AutoML_2_20241003_100055"             
 [4] "GBM_5_AutoML_2_20241003_100055"             
 [5] "XRT_1_AutoML_2_20241003_100055"             
 [6] "DRF_1_AutoML_2_20241003_100055"             
 [7] "GBM_grid_1_AutoML_2_20241003_100055_model_1"
 [8] "GBM_4_AutoML_2_20241003_100055"             
 [9] "GLM_1_AutoML_2_20241003_100055"             
[10] "DeepLearning_1_AutoML_2_20241003_100055"

# View details of a base model:
# full_lb2
# se_model <- h2o.getModel(full_lb2[1, "model_id"])  # Replace 1 with the row index for stacked ensemble for which you want to view the base models
# base_model <- h2o.getModel(se_model@model$base_models[[5]])  # Access the fifth base model
# print(base_model)

Save model object to disk

# h2o.saveModel(
# object = base_model,
# path = "D:/My Documents/AutoML_R_h2o/",
# force = T,
# export_cross_validation_predictions = FALSE,
# filename = "RandomForest_basemodel"
# )

Load model object from disk

loadtest<-h2o.loadModel("D:/My Documents/AutoML_R_h2o/RandomForest_basemodel")

Benfits of using Automl:

Automates most of the steps involved in Machine learning, thus reducing time and effort.
Less Manual Intervention and minimum coding and machine learning knowledge
Increased performance
Best model selection based on performance metric.
Creates stacked ensemble models that combine multiple models’ predictions to improve overall accuracy and robustness
Works for parallel processing and larger dataset